Forced Expiratory Volume (FEV) is an index of pulmonary function that measures the volume of the air expelled after one second of constant effort. The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children.
| VariableName | Category | Description |
|---|---|---|
| ID | Numeric | Uniquely Identified Row |
| Age | Numeric | Age of the child |
| Height | Numeric | Height of the child in inches |
| Sex | Categorical | Sex of the child |
| Smoker | Categorical | Whether the child is a non-smoker or current smoker |
| FEV | Numeric | FEV of the child in litres |
The data contains the determinations of FEB on 654 children ages 6 – 22 who are seen in childhood respiratory disease study in 1980 in East Boston, Massachusetts. The data are part of a larger study to follow the change in pulmonary function over time in children
Note: No citation required for this source (http://www.statsci.org/data/general/fev.html)
This dataset has chosen by us because of personal interests, whether we can predict the child’s FEV with the help of the available predictor, rather going for a Pulmonary Function Test, which will identify any pulmonary disease of a child. And secondly, to do statistical analysis on what are the predictors responsible to increase / decrease the pulmonary function of the child and try to find answer on these lines.
In order to predict the child’s Forced Exporatory Volume (FEV) based on the Child’s Age, Height, Sex and whether the child is a smoker or not. This will help for kick-start the treatment of the child based on the FEV reading, rather going throgh the Pulmonary Function test.
The goal of the model is to find the best model after going through the several methods, which predict the child’s FEV with minimal error. The best model would have the lowest Root Mean Square Error (RMSE) against Leave One Out cross validation (LOOCV).
We will be doing several data analysis and methods in this section, where each section will be describing what’s the part of the tasks.
childfev = read.csv("http://www.statsci.org/data/general/fev.txt",
sep = "\t",
quote = "\"",
comment.char = "",
stringsAsFactors = FALSE)
childfev$Sex = as.factor(childfev$Sex)
childfev$Smoker = as.factor(childfev$Smoker)
str(childfev)
## 'data.frame': 654 obs. of 6 variables:
## $ ID : int 301 451 501 642 901 1701 1752 1753 1901 1951 ...
## $ Age : int 9 8 7 9 9 8 6 6 8 9 ...
## $ FEV : num 1.71 1.72 1.72 1.56 1.9 ...
## $ Height: num 57 67.5 54.5 53 57 61 58 56 58.5 60 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 1 1 1 1 1 ...
## $ Smoker: Factor w/ 2 levels "Current","Non": 2 2 2 2 2 2 2 2 2 2 ...
dim(childfev)
## [1] 654 6
head(childfev$FEV,10)
## [1] 1.708 1.724 1.720 1.558 1.895 2.336 1.919 1.415 1.987 1.942
From the dataset, it observered data, Age, Heigh is a numerical variable, where as Sex / Smoker field is a categorical variable. The FEV is the numerical response variable.
pairs(childfev[c('Age','FEV','Height','Sex','Smoker')], col = "darkgrey")
From the pairs plot, it sees there is a clear signs of linear releationship between FEV and Height. However there is no clear sign of linear relationship between FEV and Age variable. The 2 categorical variable Sex and Smoker seems to have 2 distinct data. Let’s explore this
Let’s check the Correlation Matrix to explore what’s the co-relation among the variables
cor(childfev[c('Age','Height','FEV')])
## Age Height FEV
## Age 1.0000 0.7919 0.7565
## Height 0.7919 1.0000 0.8681
## FEV 0.7565 0.8681 1.0000
It seems that what the pair plot shows seems correct, from the corelation matrix it also suggests that Age and Height are higly corelation with FEV response, as well as among it self, We need to explore the variance inflation factor (VIF) while creating our model in further.
table(childfev$Sex)
##
## Female Male
## 318 336
table(childfev$Smoker)
##
## Current Non
## 65 589
Let’s start with the Multiple Linear Regression (MLR) against all the predictor (Except the ID), to predict the FEV as the response
mlr_model = lm(FEV ~ Age + Height + Sex + Smoker, data = childfev)
summary(mlr_model)
##
## Call:
## lm(formula = FEV ~ Age + Height + Sex + Smoker, data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3766 -0.2503 0.0089 0.2559 1.9205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.54422 0.23205 -19.58 < 2e-16 ***
## Age 0.06551 0.00949 6.90 1.2e-11 ***
## Height 0.10420 0.00476 21.90 < 2e-16 ***
## SexMale 0.15710 0.03321 4.73 2.7e-06 ***
## SmokerNon 0.08725 0.05925 1.47 0.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.412 on 649 degrees of freedom
## Multiple R-squared: 0.775, Adjusted R-squared: 0.774
## F-statistic: 560 on 4 and 649 DF, p-value: <2e-16
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(mlr_model)~fitted(mlr_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(mlr_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(mlr_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, it seems the constant variance assumption is suspect, as the residuals are not equally distributed.
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(mlr_model)
##
## studentized Breusch-Pagan test
##
## data: mlr_model
## BP = 65, df = 4, p-value = 3e-13
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(mlr_model))
##
## Shapiro-Wilk normality test
##
## data: resid(mlr_model)
## W = 0.99, p-value = 0.0002
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(mlr_model))^2)
## [1] 110.3
RMSE
sqrt(mean(sum((resid(mlr_model))^2)))
## [1] 10.5
RMSE LOOCV
sqrt(mean((resid(mlr_model)/(1-hatvalues(mlr_model)))^2))
## [1] 0.4145
| Metric | Value |
|---|---|
| RSS | 110.2796 |
| RMSE | 10.5014 |
| RMSE LOOCV | 0.4145 |
Let’s explore whether transformation of the response variable FEV has any siginificance to improve the score (low RMSE LOOCV)
Let’s first plot the histogram of the FEV response variable with and without the logarathimic value
par(mfrow=c(1,2))
hist(childfev$FEV,
xlab = 'FEV response Variable Data',
probability = TRUE,
breaks = 20,
col = "darkgrey",
name = "Original FEV response")
## Warning in plot.window(xlim, ylim, "", ...): "name" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "name"
## is not a graphical parameter
## Warning in axis(1, ...): "name" is not a graphical parameter
## Warning in axis(2, ...): "name" is not a graphical parameter
hist(log(childfev$FEV),
xlab = 'log(FEV) response Variable Data',
breaks = 20,
col = "darkgrey",
name = "Logarithimic FEV response")
## Warning in plot.window(xlim, ylim, "", ...): "name" is not a graphical parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...): "name"
## is not a graphical parameter
## Warning in axis(1, ...): "name" is not a graphical parameter
## Warning in axis(2, ...): "name" is not a graphical parameter
It seems that after the logarithimic transformation the FEV response doesn’t seems to be normal distibution. Hence let’s not explore the lograthimic transformation for the FEV response variable
Let’s explore the polynomial transformation of the predictor to see whether we can able to find the best model which lower the RMSE LOOCV
quad_model = lm(FEV ~ Sex + Smoker + poly(Age,2) + poly(Height,2), data = childfev)
summary(quad_model)
##
## Call:
## lm(formula = FEV ~ Sex + Smoker + poly(Age, 2) + poly(Height,
## 2), data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6065 -0.2280 0.0049 0.2250 1.7957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4622 0.0550 44.79 < 2e-16 ***
## SexMale 0.0952 0.0329 2.90 0.0039 **
## SmokerNon 0.1396 0.0575 2.43 0.0154 *
## poly(Age, 2)1 4.9250 0.7568 6.51 1.5e-10 ***
## poly(Age, 2)2 0.5473 0.5435 1.01 0.3142
## poly(Height, 2)1 15.5887 0.7807 19.97 < 2e-16 ***
## poly(Height, 2)2 2.8550 0.4969 5.75 1.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.395 on 647 degrees of freedom
## Multiple R-squared: 0.794, Adjusted R-squared: 0.792
## F-statistic: 416 on 6 and 647 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(quad_model)~fitted(quad_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(quad_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(quad_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(quad_model)
##
## studentized Breusch-Pagan test
##
## data: quad_model
## BP = 81, df = 6, p-value = 3e-15
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(quad_model))
##
## Shapiro-Wilk normality test
##
## data: resid(quad_model)
## W = 0.99, p-value = 1e-05
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(quad_model))^2)
## [1] 101
RMSE
sqrt(mean(sum((resid(quad_model))^2)))
## [1] 10.05
RMSE LOOCV
sqrt(mean((resid(quad_model)/(1-hatvalues(quad_model)))^2))
## [1] 0.398
| Metric | Value |
|---|---|
| RSS | 100.989 |
| RMSE | 10.049 |
| RMSE LOOCV | 0.398 |
| The RMSE LOOCV | seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models |
cube_model = lm(FEV ~ Sex + Smoker + poly(Age,3) + poly(Height,3), data = childfev)
summary(cube_model)
##
## Call:
## lm(formula = FEV ~ Sex + Smoker + poly(Age, 3) + poly(Height,
## 3), data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6342 -0.2264 0.0074 0.2303 1.7741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4557 0.0556 44.19 < 2e-16 ***
## SexMale 0.0978 0.0333 2.94 0.0034 **
## SmokerNon 0.1453 0.0578 2.51 0.0122 *
## poly(Age, 3)1 4.8895 0.7570 6.46 2.1e-10 ***
## poly(Age, 3)2 0.7797 0.5626 1.39 0.1662
## poly(Age, 3)3 -0.7317 0.4667 -1.57 0.1174
## poly(Height, 3)1 15.6558 0.7817 20.03 < 2e-16 ***
## poly(Height, 3)2 2.4713 0.5545 4.46 9.8e-06 ***
## poly(Height, 3)3 0.3618 0.4253 0.85 0.3953
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.395 on 645 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.793
## F-statistic: 313 on 8 and 645 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(cube_model)~fitted(cube_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Cube")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(cube_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(cube_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(cube_model)
##
## studentized Breusch-Pagan test
##
## data: cube_model
## BP = 84, df = 8, p-value = 6e-15
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(cube_model))
##
## Shapiro-Wilk normality test
##
## data: resid(cube_model)
## W = 0.99, p-value = 1e-05
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(cube_model))^2)
## [1] 100.6
RMSE
sqrt(mean(sum((resid(cube_model))^2)))
## [1] 10.03
RMSE LOOCV
sqrt(mean((resid(cube_model)/(1-hatvalues(cube_model)))^2))
## [1] 0.3984
| Metric | Value |
|---|---|
| RSS | 100.5859 |
| RMSE | 10.0293 |
| RMSE LOOCV | 0.3984 |
| The RMSE LOOCV | seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models |
high_model = lm(FEV ~ Sex + Smoker + poly(Age,6) + poly(Height,6), data = childfev)
summary(high_model)
##
## Call:
## lm(formula = FEV ~ Sex + Smoker + poly(Age, 6) + poly(Height,
## 6), data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7062 -0.2302 0.0136 0.2248 1.7638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4557 0.0562 43.69 < 2e-16 ***
## SexMale 0.0974 0.0340 2.87 0.0043 **
## SmokerNon 0.1455 0.0586 2.48 0.0133 *
## poly(Age, 6)1 4.8282 0.7663 6.30 5.5e-10 ***
## poly(Age, 6)2 0.8241 0.5698 1.45 0.1486
## poly(Age, 6)3 -0.7679 0.4742 -1.62 0.1058
## poly(Age, 6)4 0.0457 0.4424 0.10 0.9177
## poly(Age, 6)5 0.5243 0.4282 1.22 0.2212
## poly(Age, 6)6 -0.1088 0.4211 -0.26 0.7962
## poly(Height, 6)1 15.7181 0.7934 19.81 < 2e-16 ***
## poly(Height, 6)2 2.3834 0.5640 4.23 2.7e-05 ***
## poly(Height, 6)3 0.3608 0.4649 0.78 0.4379
## poly(Height, 6)4 -0.1415 0.4348 -0.33 0.7450
## poly(Height, 6)5 -0.2765 0.4186 -0.66 0.5092
## poly(Height, 6)6 -0.8348 0.4050 -2.06 0.0397 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.395 on 639 degrees of freedom
## Multiple R-squared: 0.797, Adjusted R-squared: 0.793
## F-statistic: 180 on 14 and 639 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(high_model)~fitted(high_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - higher order Polynomial")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(high_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(high_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(high_model)
##
## studentized Breusch-Pagan test
##
## data: high_model
## BP = 92, df = 14, p-value = 2e-13
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(high_model))
##
## Shapiro-Wilk normality test
##
## data: resid(high_model)
## W = 0.99, p-value = 8e-06
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(high_model))^2)
## [1] 99.47
RMSE
sqrt(mean(sum((resid(high_model))^2)))
## [1] 9.973
RMSE LOOCV
sqrt(mean((resid(high_model)/(1-hatvalues(high_model)))^2))
## [1] 0.4012
| Metric | Value |
|---|---|
| RSS | 99.4667 |
| RMSE | 9.9733 |
| RMSE LOOCV | 0.4012 |
It seems that the model is not improving beyong degree 2 and infact it’s detorating after degree 2. We will explore the polynoamial with the interaction to see if there is any improvement. Also will create a bigger model and by the help of AIC and BIC to explore, whether we can improve the model.
Let’s explore the Interaction between the predictor to see whether we can able to find the best model which lower the RMSE LOOCV
two_int_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 2, data = childfev)
summary(two_int_model)
##
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^2, data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4557 -0.2250 0.0074 0.2331 1.6724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.20138 1.42176 -0.14 0.887
## Age -0.38722 0.06897 -5.61 2.9e-08 ***
## Height 0.04381 0.02213 1.98 0.048 *
## SexMale -0.84442 0.49098 -1.72 0.086 .
## SmokerNon -0.09124 1.24797 -0.07 0.942
## Age:Height 0.00640 0.00101 6.32 5.0e-10 ***
## Age:SexMale 0.00431 0.01851 0.23 0.816
## Age:SmokerNon 0.06144 0.02379 2.58 0.010 *
## Height:SexMale 0.01555 0.00951 1.63 0.103
## Height:SmokerNon -0.00878 0.01961 -0.45 0.654
## SexMale:SmokerNon -0.03269 0.13175 -0.25 0.804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.391 on 643 degrees of freedom
## Multiple R-squared: 0.8, Adjusted R-squared: 0.797
## F-statistic: 257 on 10 and 643 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(two_int_model)~fitted(two_int_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(two_int_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(two_int_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(two_int_model)
##
## studentized Breusch-Pagan test
##
## data: two_int_model
## BP = 88, df = 10, p-value = 1e-14
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(two_int_model))
##
## Shapiro-Wilk normality test
##
## data: resid(two_int_model)
## W = 0.99, p-value = 0.0003
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(two_int_model))^2)
## [1] 98.14
RMSE
sqrt(mean(sum((resid(two_int_model))^2)))
## [1] 9.907
RMSE LOOCV
sqrt(mean((resid(two_int_model)/(1-hatvalues(two_int_model)))^2))
## [1] 0.3964
| Metric | Value |
|---|---|
| RSS | 98.1409 |
| RMSE | 9.9066 |
| RMSE LOOCV | 0.3964 |
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
three_int_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3, data = childfev)
summary(three_int_model)
##
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^3, data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4180 -0.2285 0.0059 0.2288 1.6306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.94328 7.52441 0.92 0.35648
## Age -0.52738 0.55549 -0.95 0.34278
## Height -0.06022 0.11584 -0.52 0.60331
## SexMale -5.27953 3.09717 -1.70 0.08875 .
## SmokerNon -9.43485 7.46602 -1.26 0.20680
## Age:Height 0.00807 0.00856 0.94 0.34640
## Age:SexMale -0.45320 0.14850 -3.05 0.00237 **
## Age:SmokerNon 0.48508 0.55069 0.88 0.37873
## Height:SexMale 0.07826 0.04824 1.62 0.10524
## Height:SmokerNon 0.13170 0.11503 1.14 0.25266
## SexMale:SmokerNon 8.06596 2.70524 2.98 0.00298 **
## Age:Height:SexMale 0.00719 0.00213 3.38 0.00077 ***
## Age:Height:SmokerNon -0.00630 0.00849 -0.74 0.45858
## Age:SexMale:SmokerNon 0.00974 0.05426 0.18 0.85755
## Height:SexMale:SmokerNon -0.12258 0.04275 -2.87 0.00427 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.383 on 639 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.805
## F-statistic: 193 on 14 and 639 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(three_int_model)~fitted(three_int_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(three_int_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(three_int_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(three_int_model)
##
## studentized Breusch-Pagan test
##
## data: three_int_model
## BP = 97, df = 14, p-value = 2e-14
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(three_int_model))
##
## Shapiro-Wilk normality test
##
## data: resid(three_int_model)
## W = 0.99, p-value = 0.002
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(three_int_model))^2)
## [1] 93.89
RMSE
sqrt(mean(sum((resid(three_int_model))^2)))
## [1] 9.69
RMSE LOOCV
sqrt(mean((resid(three_int_model)/(1-hatvalues(three_int_model)))^2))
## [1] 0.3907
| Metric | Value |
|---|---|
| RSS | 93.8929 |
| RMSE | 9.6898 |
| RMSE LOOCV | 0.3907 |
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
four_int_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 4, data = childfev)
summary(four_int_model)
##
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^4, data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4258 -0.2254 0.0028 0.2256 1.6281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.7105 13.3060 1.93 0.054 .
## Age -1.9541 1.0023 -1.95 0.052 .
## Height -0.3497 0.2051 -1.70 0.089 .
## SexMale -31.1727 15.4641 -2.02 0.044 *
## SmokerNon -28.3251 13.3328 -2.12 0.034 *
## Age:Height 0.0301 0.0155 1.95 0.052 .
## Age:SexMale 1.6018 1.2116 1.32 0.187
## Age:SmokerNon 1.9294 1.0083 1.91 0.056 .
## Height:SexMale 0.4734 0.2362 2.00 0.045 *
## Height:SmokerNon 0.4232 0.2057 2.06 0.040 *
## SexMale:SmokerNon 34.1547 15.5034 2.20 0.028 *
## Age:Height:SexMale -0.0241 0.0184 -1.31 0.192
## Age:Height:SmokerNon -0.0286 0.0156 -1.84 0.067 .
## Age:SexMale:SmokerNon -2.0718 1.2192 -1.70 0.090 .
## Height:SexMale:SmokerNon -0.5209 0.2370 -2.20 0.028 *
## Age:Height:SexMale:SmokerNon 0.0317 0.0186 1.71 0.088 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.383 on 638 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.805
## F-statistic: 181 on 15 and 638 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(four_int_model)~fitted(four_int_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(four_int_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(four_int_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(four_int_model)
##
## studentized Breusch-Pagan test
##
## data: four_int_model
## BP = 98, df = 15, p-value = 3e-14
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(four_int_model))
##
## Shapiro-Wilk normality test
##
## data: resid(four_int_model)
## W = 0.99, p-value = 0.002
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(four_int_model))^2)
## [1] 93.47
RMSE
sqrt(mean(sum((resid(four_int_model))^2)))
## [1] 9.668
RMSE LOOCV
sqrt(mean((resid(four_int_model)/(1-hatvalues(four_int_model)))^2))
## [1] 0.3918
| Metric | Value |
|---|---|
| RSS | 93.4651 |
| RMSE | 9.6677 |
| RMSE LOOCV | 0.3918 |
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
Let’s create a big model, with Polynomial of degree 3 and 3 way interaction between categorical-to-categorical, categorical-to-numeric, numeric-to-numeric and see it’s score. Also reduced the model via AIC and BIC to see any improvement in the score
big_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3 + poly(Age,3) + poly(Height,3), data = childfev)
summary(big_model)
##
## Call:
## lm(formula = FEV ~ (Age + Height + Sex + Smoker)^3 + poly(Age,
## 3) + poly(Height, 3), data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2966 -0.2289 0.0028 0.2179 1.6014
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.09658 7.56454 1.07 0.28488
## Age -0.61839 0.56434 -1.10 0.27359
## Height -0.08093 0.11699 -0.69 0.48929
## SexMale -3.88732 3.17077 -1.23 0.22066
## SmokerNon -10.79496 7.65232 -1.41 0.15883
## poly(Age, 3)1 NA NA NA NA
## poly(Age, 3)2 -0.46787 1.14254 -0.41 0.68231
## poly(Age, 3)3 -0.18683 0.54924 -0.34 0.73385
## poly(Height, 3)1 NA NA NA NA
## poly(Height, 3)2 -0.13970 1.02831 -0.14 0.89198
## poly(Height, 3)3 -0.95598 0.46743 -2.05 0.04125 *
## Age:Height 0.00967 0.00876 1.10 0.26987
## Age:SexMale -0.62552 0.16678 -3.75 0.00019 ***
## Age:SmokerNon 0.63679 0.57480 1.11 0.26835
## Height:SexMale 0.06013 0.04922 1.22 0.22229
## Height:SmokerNon 0.15546 0.11798 1.32 0.18807
## SexMale:SmokerNon 7.71549 2.71796 2.84 0.00467 **
## Age:Height:SexMale 0.00958 0.00238 4.03 0.000063 ***
## Age:Height:SmokerNon -0.00886 0.00887 -1.00 0.31812
## Age:SexMale:SmokerNon 0.03389 0.05575 0.61 0.54353
## Height:SexMale:SmokerNon -0.12128 0.04282 -2.83 0.00477 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.383 on 635 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.805
## F-statistic: 151 on 18 and 635 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(big_model)~fitted(big_model), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(big_model), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(big_model), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(big_model)
##
## studentized Breusch-Pagan test
##
## data: big_model
## BP = 100, df = 18, p-value = 4e-14
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(big_model))
##
## Shapiro-Wilk normality test
##
## data: resid(big_model)
## W = 0.99, p-value = 0.003
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(big_model))^2)
## [1] 93.05
RMSE
sqrt(mean(sum((resid(big_model))^2)))
## [1] 9.646
RMSE LOOCV
sqrt(mean((resid(big_model)/(1-hatvalues(big_model)))^2))
## [1] 0.3925
| Metric | Value |
|---|---|
| RSS | 93.0501 |
| RMSE | 9.6462 |
| RMSE LOOCV | 0.3925 |
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
big_model = lm(FEV ~ (Age + Height + Sex + Smoker) ^ 3 + poly(Age,3) + poly(Height,3), data = childfev)
big_model_aic = step(big_model, direction = "backward", trace = 0)
summary(big_model_aic)
##
## Call:
## lm(formula = FEV ~ Age + Height + Sex + Smoker + poly(Height,
## 3) + Age:Height + Age:Sex + Age:Smoker + Height:Sex + Height:Smoker +
## Sex:Smoker + Age:Height:Sex + Height:Sex:Smoker, data = childfev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3035 -0.2190 -0.0021 0.2167 1.6067
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2954728 2.5352999 0.12 0.90726
## Age -0.0097005 0.1592017 -0.06 0.95143
## Height 0.0423046 0.0388440 1.09 0.27653
## SexMale -3.4622592 3.0293083 -1.14 0.25350
## SmokerNon -3.5027204 1.9922327 -1.76 0.07919 .
## poly(Height, 3)1 NA NA NA NA
## poly(Height, 3)2 0.0748272 0.8601746 0.09 0.93071
## poly(Height, 3)3 -0.9398997 0.4368687 -2.15 0.03182 *
## Age:Height 0.0000545 0.0024431 0.02 0.98221
## Age:SexMale -0.5785139 0.1482446 -3.90 0.00011 ***
## Age:SmokerNon 0.0734381 0.0245113 3.00 0.00284 **
## Height:SexMale 0.0480355 0.0467074 1.03 0.30414
## Height:SmokerNon 0.0408410 0.0304463 1.34 0.18026
## SexMale:SmokerNon 7.1999317 2.6065228 2.76 0.00591 **
## Age:Height:SexMale 0.0093680 0.0023237 4.03 0.000062 ***
## Height:SexMale:SmokerNon -0.1077336 0.0396057 -2.72 0.00670 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.382 on 639 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.806
## F-statistic: 195 on 14 and 639 DF, p-value: <2e-16
It seems the quadratic model is significance with low p value <2e-16, including all the predictors are significance though except Age where the p value is around
Let’s check the Assumption of the model
Let’s verify the Assumption of the Model - Constant Varience - Normality
Let’s plot Plotted Versus Residul Plot to see the distribution
par(mfrow = c(1,2))
plot(resid(big_model_aic)~fitted(big_model_aic), ylab = "residuals", xlab = "fitted", cex = 1, pch = 1, col = "darkgrey", main = "Residuals Versis Fitted Model - Quad")
abline(h = 0, lwd = 2, col = "darkorange")
qqnorm(resid(big_model_aic), cex = 1, pch = 1, col = "darkgrey")
qqline(resid(big_model_aic), lwd = 2, col = "darkorange")
From the above residuals versus fitted plot, though it seems better than that of Multiple Linear Regression, it’s not doing good job for the higher fitted values. It seems as the fitted value goes higher, the residual seems to be wider, hence the constant variance assumption is suspect
This is the same case for the Q-Q plot, where both the tail seems to the a little fat tails, which tells us that normality assuption seems suspect for the model.
Let’s do the BP Test and Shapiro Test to gather some more evidence
bptest(big_model_aic)
##
## studentized Breusch-Pagan test
##
## data: big_model_aic
## BP = 94, df = 14, p-value = 6e-14
The BP Test is very low which confirms that constant variance is suspect
shapiro.test(resid(big_model_aic))
##
## Shapiro-Wilk normality test
##
## data: resid(big_model_aic)
## W = 0.99, p-value = 0.003
The shapiro test also seems lkow, which confirms that the normality assumption also seems suspect
Let’e calculate the RSS, RMSE and RMSE LOOCV of the model RSS
sum((resid(big_model_aic))^2)
## [1] 93.29
RMSE
sqrt(mean(sum((resid(big_model_aic))^2)))
## [1] 9.659
RMSE LOOCV
sqrt(mean((resid(big_model_aic)/(1-hatvalues(big_model_aic)))^2))
## [1] 0.3897
| Metric | Value |
|---|---|
| RSS | 93.2936 |
| RMSE | 9.6589 |
| RMSE LOOCV | 0.3897 |
The RMSE LOOCV seems better than that of Multiple Linear Regression. We will compare at the last while making the decission around models
compare_model = data.frame(
Model = c("mlr","quad", "cube", "high_poly", "i2_way_int","i3_way_int","i4_way_int","big_model","big_model_aic"),
No_Of_Parameters = c(length(coef(mlr_model)),length(coef(quad_model)),length(coef(cube_model)),length(coef(high_model)),length(coef(two_int_model)),length(coef(three_int_model)),length(coef(four_int_model)),length(coef(big_model)),length(coef(big_model_aic))),
RSS = c(sum((resid(mlr_model))^2),sum((resid(quad_model))^2),sum((resid(cube_model))^2),sum((resid(high_model))^2),sum((resid(two_int_model))^2),sum((resid(three_int_model))^2),sum((resid(four_int_model))^2),sum((resid(big_model))^2),sum((resid(big_model_aic))^2)),
RMSE = c(sqrt(mean(sum((resid(mlr_model))^2))),sqrt(mean(sum((resid(quad_model))^2))),sqrt(mean(sum((resid(cube_model))^2))),sqrt(mean(sum((resid(high_model))^2))),sqrt(mean(sum((resid(two_int_model))^2))),sqrt(mean(sum((resid(three_int_model))^2))),sqrt(mean(sum((resid(four_int_model))^2))),sqrt(mean(sum((resid(big_model))^2))),sqrt(mean(sum((resid(big_model_aic))^2)))),
RMSE_LOOCV = c(sqrt(mean((resid(mlr_model)/(1-hatvalues(mlr_model)))^2)),sqrt(mean((resid(quad_model)/(1-hatvalues(quad_model)))^2)),sqrt(mean((resid(cube_model)/(1-hatvalues(cube_model)))^2)),sqrt(mean((resid(high_model)/(1-hatvalues(high_model)))^2)),sqrt(mean((resid(two_int_model)/(1-hatvalues(two_int_model)))^2)),sqrt(mean((resid(three_int_model)/(1-hatvalues(three_int_model)))^2)),sqrt(mean((resid(four_int_model)/(1-hatvalues(four_int_model)))^2)),sqrt(mean((resid(big_model)/(1-hatvalues(big_model)))^2)),sqrt(mean((resid(big_model_aic)/(1-hatvalues(big_model_aic)))^2)))
)
| Model | No_Of_Parameters | RSS | RMSE | RMSE_LOOCV |
|---|---|---|---|---|
| mlr | 5 | 110.28 | 10.501 | 0.4145 |
| quad | 7 | 100.99 | 10.049 | 0.3980 |
| cube | 9 | 100.59 | 10.029 | 0.3984 |
| high_poly | 15 | 99.47 | 9.973 | 0.4012 |
| i2_way_int | 11 | 98.14 | 9.907 | 0.3964 |
| i3_way_int | 15 | 93.89 | 9.690 | 0.3907 |
| i4_way_int | 16 | 93.47 | 9.668 | 0.3918 |
| big_model | 21 | 93.05 | 9.646 | 0.3925 |
| big_model_aic | 16 | 93.29 | 9.659 | 0.3897 |
From the above it seems the best model is the big model reduced via AIC. Let’s check